Skip to content

Implement 4over6 NVFP4 recipe#2972

Open
zianglih wants to merge 57 commits into
NVIDIA:mainfrom
zianglih:4over6
Open

Implement 4over6 NVFP4 recipe#2972
zianglih wants to merge 57 commits into
NVIDIA:mainfrom
zianglih:4over6

Conversation

@zianglih
Copy link
Copy Markdown
Contributor

@zianglih zianglih commented May 9, 2026

Description

@HumansAnd

Implement 4over6 nvfp4 from:

FlashInfer PR:

Enable per-block map-to-4 versus map-to-6 candidate selection for 1D/2D NVFP4 quantization in the NVFP4BlockScaling recipe. This mode currently requires RHT and stochastic rounding to be disabled. Both original per-tensor scaling and row-scaling NVFP4 introduced by #2931 are supported.

This PR also fixes a few minor bugs for row-scaled NVFP4 from #2931.

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Adds scoped NVFP4 4over6 control through NVTE_NVFP4_4OVER6=weights|activations|all, with unset preserving existing behavior, and threads the selected scope through recipes, quantizers, tensor metadata, split quantization, single-tensor quantization, and C++ tensor/config APIs.
  • Implements 1D & 2D NVFP4 4over6 quantization in the existing NVFP4 CUDA paths by comparing TE-style map-to-4 and map-to-6 FP4 candidates with the original 4over6 MSE rule, choosing map-to-6 on ties, honoring NVTE_USE_FAST_MATH, and rejecting unsupported combinations such as stochastic rounding, grouped tensors, and RHT.
  • Updates dequantization and NVFP4 GEMM scaling to respect per-tensor 4over6 metadata, using 256-based normalization for 4over6 tensors and 448-based normalization for regular NVFP4 tensors without requiring callers to do hidden rescaling.
  • Extends the Python reference implementation to mirror the intended ground truth, meaning TE-style candidate quantization plus original 4over6 MSE/compare logic, and uses this reference for bitwise exact tests where fast math is disabled.
  • Expands C++ and Python coverage across exact NVFP4 quantization, GEMM, dequantization, recipe scope resolution, quantized tensor handling, numerics, sanity, CUDA graph, torch compile, CPU offload, fusible ops, and backward override paths, while documenting the new environment variable and known unsupported modes.

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@zianglih zianglih marked this pull request as draft May 9, 2026 03:50
@zianglih zianglih changed the title Implement 4over6 nvfp4 Implement 4over6 nvfp4 recipe May 9, 2026
@zianglih zianglih changed the title Implement 4over6 nvfp4 recipe Implement 4over6 NVFP4 recipe May 9, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 9, 2026

Greptile Summary

This PR adds NVFP4 4over6 quantization support to TransformerEngine's NVFP4BlockScaling recipe. For each 1x16 block, it quantizes with both a map-to-4 candidate (1.5x expanded scale) and a map-to-6 candidate (normal scale), then selects the lower-error option (tie goes to map-to-6), mirroring the approach from the fouroversix paper/repo.

  • Adds a new CUDA kernel (quantize_4over6_nvfp4.cuh) with pipeline-staged shared-memory loading, per-block candidate comparison (MAE or MSE error), 2D/1D support, and row-scaled amax support; pairs it with updated dequantization and GEMM scale kernels that parameterize the E4M3 normalization divisor per tensor.
  • Threads the 4over6 mode, E4M3 max bound, and error-mode metadata through the full Python to C++ stack with explicit rejection guards for stochastic rounding, RHT, and grouped tensors; extends the Python reference quantizer and adds broad test coverage.

Confidence Score: 5/5

The PR is safe to merge. The 4over6 feature is entirely opt-in and isolated behind the new nvfp4_4over6 recipe field; all existing NVFP4 paths are untouched when the flag is unset.

The kernel implementation is well-guarded with explicit rejection of stochastic rounding, RHT, and grouped quantization at multiple call-site layers. The two findings are limited to an inconsistency in a property setter that all current internal call sites avoid, and a missing secondary validation inside quantize_4over6 that is already covered by quantize_fwd_helper for all real callers.

The new quantize_4over6_nvfp4.cuh kernel and the grouped_tensor_storage.py property setter are the two places worth a careful second read.

Important Files Changed

Filename Overview
transformer_engine/common/cast/nvfp4/quantize_4over6_nvfp4.cuh New 671-line CUDA kernel implementing 4over6 candidate comparison; pipeline-staged async shared-memory load, correct 2D warp-level reductions, and proper sm_100 guard. Missing output->nvfp4_4over6 == true validation when called directly (bypassing quantize_fwd_helper).
transformer_engine/common/cast/dispatch/quantize.cuh Adds tensor/config consistency checks and dispatches to quantize_4over6 when nvfp4_4over6 is set; three dispatch sites all correctly guard and branch.
transformer_engine/common/cast/nvfp4/dequantize_nvfp4.cuh Parameterizes E4M3_MAX as a compile-time template argument; ROW_SCALED_NVFP4 also promoted to template; correct dispatch for both 256 and 448 bounds.
transformer_engine/common/recipe/nvfp4.cu Correctly plumbs per-tensor fp8_max_A/fp8_max_B into the GEMM scale kernel instead of the old hardcoded 448 constant.
transformer_engine/pytorch/csrc/quantizer.cpp NVFP4Quantizer constructor and quantize_impl correctly thread nvfp4_use_4over6, nvfp4_e4m3_max, nvfp4_4over6_err_mode and err_use_fast_math through quant_config; RHT/stochastic-rounding guards present.
transformer_engine/pytorch/csrc/extensions/cast.cpp All split-quantize helpers and bulk-alloc paths propagate 4over6 metadata; grouped and RHT paths correctly reject 4over6 with clear error messages.
transformer_engine/pytorch/tensor/nvfp4_tensor.py nvfp4_use_4over6 and nvfp4_e4m3_max threaded through new, copy, reduce_ex, all-gather metadata, and view/reshape autograd functions.
transformer_engine/pytorch/tensor/storage/grouped_tensor_storage.py Adds nvfp4_use_4over6 and nvfp4_e4m3_max properties; nvfp4_e4m3_max setter does not replicate the nvfp4_use_4over6 guard from _initialize_storage_fields.
transformer_engine/pytorch/custom_recipes/quantization_ref_nvfp4.py Python reference 4over6 quantizer correctly mirrors CUDA candidate-compare logic; dimension handling for row_scaled and 2D cases is consistent with the kernel.
transformer_engine/common/include/transformer_engine/transformer_engine.h New kNVTENVFP44Over6 (9) and kNVTENVFP4E4M3Max (10) tensor params plus four new QuantizationConfigAttributes; getter/setter helpers correctly encode/decode values.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Python Recipe NVFP4BlockScaling nvfp4_4over6 scope] --> B[NVFP4BlockScalingRecipeState resolve nvfp4_use_4over6]
    B --> C[NVFP4Quantizer nvfp4_use_4over6 bool nvfp4_e4m3_max int]
    C --> D{quantize path}
    D -- single tensor --> E[quantize_impl set quant_config fields reject RHT + stochastic_rounding]
    D -- split quant --> F[split_quantize_nvfp4_impl reject RHT set per-config fields]
    D -- grouped --> G[group_quantize_nvfp4_impl reject 4over6]
    E --> H[quantize_fwd_helper check tensor/config consistency]
    F --> H
    H -- nvfp4_use_4over6=true --> I[quantize_4over6 E4M3_MAX switch ErrMode switch]
    H -- nvfp4_use_4over6=false --> J[existing quantize_transpose kernels]
    I --> K[quantize_4over6_kernel load tile async compute ScalePair map4+map6 pick lower error write selected scale+data]
    K --> L[NVFP4Tensor _nvfp4_use_4over6 _nvfp4_e4m3_max]
    L --> M[dequantize_fp4_kernel E4M3_MAX template]
    L --> N[nvte_nvfp4_compute_per_tensor_scale fp8_max from tensor]
Loading

Reviews (8): Last reviewed commit: "Remove 4over6 benchmark" | Re-trigger Greptile

Comment thread transformer_engine/pytorch/csrc/extensions/cast.cpp Outdated
Comment thread transformer_engine/common/transpose/quantize_transpose_vector_blockwise_fp4.cu Outdated
Comment thread transformer_engine/common/recipe/__init__.py
Comment thread tests/pytorch/test_sanity.py Outdated
@zianglih
Copy link
Copy Markdown
Contributor Author

zianglih commented May 11, 2026

Functionality has been verified by internal RL experiments.
We may want to allow separate 4over6 config for weights and activations, maybe NVTE_NVFP4_ENABLE_4OVER6=weights|activations|all.

@ptrendx ptrendx requested a review from negvet May 11, 2026 17:12
@ptrendx ptrendx added community-contribution PRs from external contributor outside the core maintainers, representing community-driven work. fp4 labels May 11, 2026
@zianglih
Copy link
Copy Markdown
Contributor Author

Need to rebase.

@zianglih zianglih marked this pull request as draft May 11, 2026 21:17
@zianglih zianglih marked this pull request as ready for review May 11, 2026 22:36
* its values are populated during quantization.
*/
kNVTERowScaledNVFP4 = 8,
kNVTENVFP44Over6 = 9, /*!< Whether an NVFP4 tensor uses 4over6 scaling */
Copy link
Copy Markdown
Collaborator

@timmoon10 timmoon10 May 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are specifying this redundantly in NVTETensor and NVTEQuantizationConfig. If this option can be isolated to quantization, then we should not add clutter to the tensor. If the option is needed for downstream consumers (dequantization, GEMM), then it should be treated as part of the tensor data. I'm not especially familiar, but 4over6 seems like it should be specific to quantization.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4over6 changes the decode convention from 1 / (6 * 448) to 1 / (6 * 256). Therefore, for our current representation 4over6 is part of the tensor data contract, not just a quantization option.

using namespace detail;
constexpr float fp8_max = TypeExtrema<fp8e4m3>::max; // 448.0f;
constexpr float fp4_max = TypeExtrema<fp4e2m1>::max; // 6.0f;
constexpr float fp8_max = USE_4OVER6 ? 256.0f : TypeExtrema<fp8e4m3>::max; // 448.0f;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much benefit does changing the FP8 scale have on convergence? If we don't see a clear benefit, then it would be nicer to use the same scale for 4over6 and non-4over6. That way keep can keep this logic confined to quantization, and downstream consumers are completely unaffected.

If there is an impact on training quality, we should still consider disentangling the FP8 scaling from 4over6. I don't see why other NVFP4 recipes might not benefit from tweaking the scaling.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the original paper:

Finally, we make one modification to the computation of the tensor scale α (Equation 1) when
quantizing to NVFP4 with 4/6. When MFP4 ×MFP8 is used to compute the tensor scale, it ensures
that all quantized values will be less than 6 ×448. However, this makes it impossible to select a scale
of 4 for the blocks that contain a tensor’s largest values, because the block’s scale would need to be
448 × 6/4 = 672, which would overflow since 448 is the maximum value that can be represented by
E4M3. As a result, when computing the tensor scale, we replace MFP8 to 256 in Equation 1, since
256 is the largest E4M3 that can be multiplied by 6/4 and represented without error in E4M3, as 384.

Also:

In Section 3.1, we propose calculating the FP32 global tensor scale using 256 as the maximum FP8
E4M3 value rather than the default of 448, as this allows blocks with a tensor’s largest value to have
the option to have a largest FP4 value of 4. In Figure 6, we find that this provides a marginal benefit
over using the standard tensor scale calculation. Even though this adjustment only affects a small
number of large values, this performance gain may come from the fact that larger activation values
can have an outsize impact on model performance. This adjustment is incorporated into the remaining
experiments in this section.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if there are internal or external studies about the convergence. But this is required to make it work. We need the largest value that is smaller than 448/1.5 and which is itself, and its multiplication by 1.5 is represented by E4M3 exactly. This would help to avoid quantization noise on both map to 4 and map to 6 paths.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We did find the use of 256 to calculate the second level scaling factor helped convergence vs 448, but only slightly.

It's possible that the premise of the paper's argument (prevent saturations when 4 scaling effectively multiplies the block decode scale by 1.5) is sound, but a value larger than 256 can achieve this and the perfect representation of the block with the global amax value with both scalings is not worth the extra range loss.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me make 256 scaling a separate env var disabled by default

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

448, 320, 288, 256 are all potential candidates for map-to-6:

  • 448: effectively disable map-to-4 option above 256, preserve range
  • 320, 288: map-to-4 uses 448, no precise 1.5x
  • 256: map-to-4 uses 384, precise 1.5x

For now let me refactor the interface to NVTE_NVFP4_4OVER6_E4M3="448"|"256", default to "448" and dispatches to a number in template parameter in C++ code instead of a boolean toggle. People can add support for other values or make it more generic (like directly parsing the env var digits) in the future.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NVTE_NVFP4_4OVER6_E4M3_USE_256=weights|activations|all is a cleaner pattern and allows separate configuration.

Comment thread tests/pytorch/utils.py Outdated
Comment thread transformer_engine/common/cast/dispatch/quantize.cuh Outdated
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is okay, but it would provide much more confidence if the NVFP4 quantization tests compared against a CPU reference impl.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extended tests/cpp/operator/test_cast_nvfp4_transpose.cu coverage in 3bb42b1.

@zianglih zianglih marked this pull request as draft May 12, 2026 02:01
@zianglih zianglih marked this pull request as ready for review May 12, 2026 06:45
@zianglih zianglih requested a review from timmoon10 May 12, 2026 06:47
@zianglih zianglih marked this pull request as draft May 12, 2026 09:03
@zianglih zianglih marked this pull request as ready for review May 12, 2026 10:10
nvfp4_4over6 : {None, 'weights', 'activations', 'all'}, default = None
Select tensors that use NVFP4 4over6. In this mode NVFP4
quantization evaluates per-block map-to-4 and map-to-6 candidates
and chooses the one with lower MSE. Ties choose map-to-6. The
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need both MSE (better for post-training?) and MAE (better for pre-training as per our internal studies) to be supported, with MAE as the default.

using namespace detail;
constexpr float fp8_max = TypeExtrema<fp8e4m3>::max; // 448.0f;
constexpr float fp4_max = TypeExtrema<fp4e2m1>::max; // 6.0f;
constexpr float fp8_max = USE_4OVER6 ? 256.0f : TypeExtrema<fp8e4m3>::max; // 448.0f;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if there are internal or external studies about the convergence. But this is required to make it work. We need the largest value that is smaller than 448/1.5 and which is itself, and its multiplication by 1.5 is represented by E4M3 exactly. This would help to avoid quantization noise on both map to 4 and map to 6 paths.

@Oleg-Goncharov Oleg-Goncharov self-requested a review May 12, 2026 16:37
zianglih added 20 commits May 13, 2026 00:36
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
@negvet
Copy link
Copy Markdown
Collaborator

negvet commented May 13, 2026

What is the e2e step time increase with 4/6 on some typical workload?

zianglih added 2 commits May 13, 2026 02:36
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
@zianglih
Copy link
Copy Markdown
Contributor Author

zianglih commented May 13, 2026

Major changes from last time:

  • Use standalone quantization kernel implementation instead of folding into existing code. 4over6 quantize is very fp32 compute bound (Implement 4over6 NVFP4 recipe #2972 (comment) and Implement 4over6 NVFP4 recipe #2972 (comment)) and latency hiding techniques in TE original nvfp4 quant kernels lead to higher register pressure and worse performance. There is not much we could do regarding fp32 arithmetic bottleneck without changing heuristics. I think even if we want to further optimize perf/heuristics we should do it in a separate PR and extend as new error modes. cc @Oleg-Goncharov @kwyss-nvidia
  • Allow both 448 and 256 configurations. The user can config by setting NVTE_NVFP4_4OVER6_E4M3_USE_256. However, all underlying implementations encodes nvfp4_e4m3_max and E4M3_MAX template parameter instead of a boolean flag so we can easily extend other values in the future. cc @timmoon10 @kwyss-nvidia @negvet
  • Add and default to MAE error mode. cc @negvet
  • For 4over6 quantize cpp test, we now don't check map-to-4 vs map-to-6 selection and accept either to be bitwise exact. This avoids numerics drift from CPU arch. Python test still has strict candidate selection coverage. cc @Oleg-Goncharov

@zianglih zianglih marked this pull request as ready for review May 13, 2026 09:48
@zianglih zianglih requested review from ksivaman and ptrendx as code owners May 13, 2026 09:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution PRs from external contributor outside the core maintainers, representing community-driven work. fp4

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants